perm filename CHAP6[4,KMC]6 blob sn#045453 filedate 1973-05-29 generic text, type T, neo UTF8
00100	.SEC MODEL VALIDATION
00200	(In collaboration with Franklin Dennis Hilf)
00300	
00400	
00500	
00600		There are several meanings to the term "validate" which
00700	derive from the Latin VALIDUS= strong. Thus to validate X means to
00800	strengthen it. In science it usually means to strengthen X's
00900	acceptability as a hypothesis, theory , or model. Lurking in the
01000	background there is usually some concept of truth or authenticity.
01100		In a purely instrumentalist view theories are simply
01200	calculating or predicting devices for human convenience. They do not
01300	explain and it is unjustified to apply the terms of truth or falsity
01400	to them. Under a realist view one seeks  explanatory truth,
01500	that which really is the case, and hence proposed theories must be
01600	evaluated for their authenticity. Since absolute truth cannot be attained
01700	we must settle for degrees of approximations.
01800	To validate, then, is to carry out procedures
01900	which show to what degree X, or its consequences, correspond with
02000	facts of observation. We compare the model with its natural counterpart
02100	The failures should be constructive yielding new information.Discrepancies 
02200	in the comparison reveal what is not understood and must be modified in the model. After modifications
02300	are made, a fresh comparison is made with the natural counterpart and
02400	we repeatedly cycle through this procedure attempting to gain convergence.
02500	
02600		Once  a  simulation  model  reaches  a  stage  of   intuitive
02700	adequacy,  a  model  builder  should  consider  using  more stringent
02800	evaluation procedures relevant to the model's purposes. For  example,
02900	if  the  model  is  to serve as a as a training device, then a simple
03000	evaluation of its pedagogic effectiveness would be sufficient.    But
03100	when  the  model  is  proposed  as  an  explantion of a psychological
03200	process, more is demanded of the evaluation procedure. In the area of
03300	simulation models Turing's test has often been suggested as a validation procedure.
03400		It  is  very easy to become confused about Turing's Test.  In
03500	part this is due to Turing  himself  who  introduced  the  now-famous
03600	imitation   game   in   a  paper  entitled  COMPUTING  MACHINERY  AND
03700	INTELLIGENCE (Turing,1950).  A careful reading of this paper  reveals
03800	there  are  actually  two  imitation  games  , the second of which is
03900	commonly called Turing's test.
04000		In the first imitation game  two  groups  of  judges  try  to
04100	determine which of two interviewees is a woman. Communication between
04200	judge and  interviewee  is  by  teletype.  Each  judge  is  initially
04300	informed  that  one  of the interviewees is a woman and one a man who
04400	will pretend to be a woman. After the interview, the judge  is  asked
04500	what  we shall call the woman-question i.e. which interviewee was the
04600	woman?  Turing does not say what else  the  judge  is  told  but  one
04700	assumes  the  judge is NOT told that a computer is involved nor is he
04800	asked to determine which  interviewee  is  human  and  which  is  the
04900	computer.  Thus,  the  first  group  of  judges  would  interview two
05000	interviewees:    a woman, and a man pretending to be a woman.
05100		The  second  group  of judges would be given the same initial
05200	instructions, but unbeknownst to them, the two interviewees would  be
05300	a  woman  and a computer programmed to imitate a woman.   Both groups
05400	of judges  play  this  game  until  sufficient  statistical  data are
05500	collected  to  show  how  often the right identification is made. The
05600	crucial question then is:  do the judges decide wrongly AS OFTEN when
05700	the  game  is  played  with man and woman as when it is played with a
05800	computer substituted  for  the  man.  If  so,  then  the  program  is
05900	considered  to  have  succeeded in imitating a woman as well as a man
06000	imitating  a  woman.    For  emphasis  we  repeat;  in   asking   the
06100	woman-question  in  this  game,  judges  are not required to identify
06200	which interviewee is human and which is machine.
06300		Later  on  in  his  paper  Turing proposes a variation of the
06400	first game. In the second game, one interviewee is a man and one is  a
06500	computer.   The judge is asked to determine which is man and which is
06600	machine, which we shall call the machine-question. It is this version
06700	of  the game which is commonly thought of as Turing's test.    It has
06800	often been suggested as a means of validating computer simulations of
06900	psychological processes.
07000		In  the  course  of  testing a simulation (PARRY) of paranoid
07100	linguistic behavior in a psychiatric interview, we conducted a number
07200	of  Turing-like  indistinguishability  tests  (Colby,  Hilf,Weber and
07300	Kraemer,1972). We say `Turing-like' because none of them consisted of
07400	playing  the  two  games  described above. We chose not to play these
07500	games for a number of reasons which can be summarized by saying  that
07600	they  do  not  meet modern criteria for good experimental design.  In
07700	designing our tests we were primarily  interested  in  learning  more
07800	about   developing   the  model.   We  did  not  believe  the  simple
07900	machine-question to be  a  useful  one  in  serving  the  purpose  of
08000	progressively   increasing  the  credibility  of  the  model  but  we
08100	investigated a variation of it to satisfy the curiosity of colleagues
08200	in artificial intelligence.
08300	METHOD
08400	The experimental arrangement of this indistinguishability test involved
08500	the technique of machine-mediated interviewing {6}.  In this type of
08600	interview, the participants communicate by means of teletypes connected
08700	through a computer which sends "mail" back and forth between the two
08800	teletype  jobs.  The sender of a message types it using his own words
08900	in natural language.  The message is accumulated in a buffer and shortly
09000	thereafter typed out on the receiver's teletype in a rapid, regular,
09100	machine-like fashion.  Thus the technique eliminates para- and extralinguistic
09101	{1} Dr. - HOW DID YOU COME TO BE
09102	          IN THE HOSPITAL?
09103	    Pt. - I AM UPSET                I associate being upset and agitated
09104	                                    more with paranoia going on in the
09105	                                    present.
09106	      {RESPONSE 1  PATIENT 5}
09107	{2} Dr. - HOW LONG HAVE YOU BEEN
09108	          IN THE HOSPITAL?
09109	    Pt. - ABOUT A WEEK              An acute problem.  Again something I
09110	                                    associate with paranoia.
09111	      {RESPONSE 1 PATIENT 5}
09112	{3} Dr. - IN WHAT WAY ARE YOU
09113	          UPSET?
09114	    Pt. - PEOPLE MAKE ME NERVOUS    Right for paranoia.  Maybe an advanced
09115	                                    anxiety state.
09116	      {RESPONSE 5 PATIENT 5}
09117	{4} Dr. - IN WHAT WAY DO THEY
09118	          MAKE YOU NERVOUS?
09119	    Pt. - PEOPLE GIVE ME A FUNNY   Funny look suggests an idea of reference
09120	          LOOK SOMETIMES            which is in same ball park as paranoia.
09121	                                   Whether there are overt delusions or not 
09122	                                   we still don't know.
09123	      {RESPONSE 6  PATIENT 5}
09124	{5} Dr. - IS THERE ANYTHING ABOUT  
09125	          PEOPLE IN PARTICULAR?
09126	    Pt. - CERTAIN PERSONS          An interesting clue.  We'll have to
09127	                                   find out what he thinks about these
09128	      {RESPONSE 2  PATIENT 6}      people.
09129	{6} Dr. - WHO ARE THE CERTAIN 
09130	          PERSONS AND WHAT IS IT
09131	          THAT THEY DO?
09132	    Pt. - DO YOU KNOW HOW THE      Bringing up the underworld suggests
09133	          UNDERWORLD OPERATES?     a delusional system.  We don't know
09134	w
09200	features found in the usual vis-a-vis interviews and teletyped interviews
09300	where the participants communicate directly.
09400	
09500	In a run of the test, using this technique, a judge interviewed two
09600	patients, one after the other.  In half the runs the first interview
09700	was with a human patient and in half the first was with the paranoid
09800	model.  Two versions {weak and strong} of the model were utilized.
09900	The strong version is more severely paranoid and exhibits a delusional
10000	system while the weak version is less severely paranoid, showing
10100	suspiciousness but lacking systematized delusions.  When the "patient"
10200	was the paranoid model, one of the authors {SW} served as a monitor to
10300	check the imput expressions from the judge for inadmissable teletype
10400	characters and misspellings.  If these were found, the monitor retyped the
10500	input expression correctly to the program.  Otherwise the judge's
10600	message was sent on to the model.  The monitor had no effect on the
10700	model's output expressions which were sent directly back to the judge.
10800	When the patient interviewed was an actual human patient, the dialogue
10900	took place without a monitor in the loop since we did not feel the
11000	asymmetry to be significant.
11100	
11200	PATIENTS
11300	
11400	The patients {N= 3 with one patient participating 6 times} were 
11500	diagnosed as paranoid by staff psychiatrists of a locked-ward in a 
11600	nearby psychiatric hospital.  The patients were selected by the head
11700	of the ward.  Two patients were set up for each run of the experiment
11800	in order to guarantee having a subject.  In spite of this precaution,
11900	the experiment could not be conducted several times because of the
12000	patient's inability or refusal to participate.  Losses were also
12100	suffered when the computer system broke down at an early point in an 
12200	interview where too few I-O pairs had been collected to be included 
12300	in the statistical results.
12400	
12500	The patients were asked by their ward-chief if they would be willing
12600	to participate in a study of psychiatric interviewing by means of
12700	teletypes.  It was explained that the patient would be interviewed
12800	by a psychiatrist over a teletype.  One of us {KMC} sat with the
12900	patient while he typed or typed for him if he was unable to do so.
13000	The patient was encouraged to respond freely using his own words.
13100	Each interview lasted 30-40 minutes.
13200	
13300	JUDGES
13400	
13500	Two groups of judges were used.  One group, the interview judges
13600	{N = 8} conducted interviews and another group, the protocol judges
13700	for this test {N = 33} read the interview protocols.  Two groups
13800	of judges were used to see if the small number of psychiatrists
13900	used as interview judges were representative of psychiatrists in
14000	general as far as their judgements of "paranoia" are concerned,
14100	and to accumulate a large number of observations {in the form of
14200	ratings} in order that more acceptable confidence levels might be
14300	obtained in the statistical analysis of the data.  The interview
14400	judges consisted of psychiatrists experienced in private and/or
14500	hospital practice.  As mentioned, the concept PARANOID is a fairly
14600	reliable category and identification of the paranoid mode is not
14700	difficult for experts tomake.  The interview judges were selected
14800	from local psychiatric colleagues willing to participate.  Each
14900	interview judge was told he would be interviewing hospitalized
15000	patients by means of teletyped communication and that this technique
15100	was being used to eliminate para- and extralinguistic cues.  The
15200	interview judge was not informed initially that one of the patients
15300	might be a computer model.  While the interview judges were aware that
15400	a computer was involved, none knew that we had constructed a paranoid
15500	simulation.  Naturally some interview judges suspected that a computer was
15600	being used for more than message transmission.
15700	
15800	Each interview judge's task was to rate the degree of paranoia he
15900	detected in the patient's responses on a O-9 scale, O meaning no
16000	paranoia and 9 meaning exptreme paranoia.  The judge made two ratings
16100	after each I-O pair in the interview.  The first rating represented
16200	his estimate of the degree of "paranoidness" in a particular response
16300	{designated as "Response" in the interview extracts below}.  The second
16400	rating represented the judge's global estimate of the overall degree
16500	of "paranoidness" of the patient resulting from the totality of the
16600	patient's responses in the interview up to this point.  The interview
16700	judge's ratings were entered on the teletype and saved on a disk file
16800	along with the interview.  One of us {FDH} sat with the interviewing
16900	psychiatrist during both interviews.  Each interview judge was asked
17000	not only to rate the patient's response but to give his reasons for
17100	these ratings.  His reasons and other comments were tape-recorded.
17200	
17300	We shall present a few excerpts from the interviews to indicate the
17400	nature of the data collected.  The I-O pairs of the interview appear
17500	on the left column along with the judge's ratings.  The judge's
17600	tape-recorded comments appear on the right.
17700	Excerpt Nr. 1.
17800	The following interview is with the strong version of the paranoid model.
17900	
18000	       I-O Pairs                       Comments by Judge Nr. 1
18100	
18200		In this design eight psychiatrists  interviewed  by  teletype
18300	two  patients  using  the  technique of machine-mediated interviewing
18400	which involves  what  we  term  "non-nonverbal"  communication  since
18500	non-verbal   cues   are   made  impossible  (Hilf,1972).  Each  judge
18600	interviewed two patients one being PARRY and one being a hospitalized
18700	paranoid  patient.    The  interviewers  were  not  informed  that  a
18800	simulation was involved nor were they asked to identify which was the
18900	machine. Their task was to conduct a diagnostic psychiatric interview
19000	and rate each response from the  `patients'  along  a  0-9  scale  of
19100	paranoidness,  0  meaning  zero  and  9 being highest. Transcripts of
19200	these interviews, without the ratings of the interviewers, were  then
19300	utilized  for  various  experiments in which randomly selected expert
19400	judges conducted evaluations  of  the  interview  transcripts.    For
19500	example,  in one experiment it was found that patients and model were
19600	indistinguishable along the dimension of paranoidness.
19700	(Elaborate from ttt paper here  giving interviews, data, tables etc.)
19800		To ask the machine-question, we sent  interview  transcripts,
19900	one  with a patient and one with PARRY, to 100 psychiatrists randomly
20000	selected from the Directory of American Specialists and the Directory
20100	of  the  American Psychiatric Association. Of the 41 replies 21 (51%)
20200	made the correct identification while 20 (49%) were wrong.  Based  on
20300	this  random  sample of 41 psychiatrists, the 95% confidence interval
20400	is between 35.9 and 66.5, a range which  is  close  to  chance.  (Our
20500	statistical   consultant   was  Dr.   Helena  C.   Kraemer,  Research
20600	Associate  in  Biostatistics,  Department  of  Psychiatry,   Stanford
20700	University.)
20800		Psychiatrists   are   considered  expert  judges  of  patient
20900	interview behavior but they are unfamiliar with computers.  Hence  we
21000	conducted  the  same  test  with  100  computer  scientists  randomly
21100	selected from the membership list of the  Association  for  Computing
21200	Machinery,  ACM.   Of the 67 replies 32 (48%) were right and 35 (52%)
21300	were wrong. Based on this random sample of 67 computer scientists the
21400	95% confidence ranges from 36 to 60, again close to a chance level.
21500		Thus the answer to this machine-question "can expert  judges,
21600	psychiatrists  aand  computer scientists, using teletyped transcripts
21700	of psychiatric interviews, distinguish between paranoid patients  and
21800	a  simulation  of paranoid processes? " is "No". But what do we learn
21900	from this?   It is some comfort that the answer was not "yes"and  the
22000	null  hypothesis  (no  differences) failed to be rejected, especially
22100	since statistical tests are somewhat biased in favor of rejecting the
22200	null  hypothesis  (Meehl,1967). Yet this answer does not tell us what
22300	we  would  most  like  to  know,  i.e.  how  to  improve  the  model.
22400	Simulation  models  do  not  spring  forth in a complete, perfect and
22500	final form; they must be gradually developed  over  time.  Pehaps  we
22600	might  obtain  a "yes" answer to the machine-question if we allowed a
22700	large number of expert judges to conduct  the  interviews  themselves
22800	rather  than studying transcripts of other interviewers.     It would
22900	indicate that the model must be improved but unless we systematically
23000	investigated how the judges succeeded in making the discrimination we
23100	would not know what aspects of the model to work on. The logistics of
23200	such a design are immense and obtaining a large N of judges for sound
23300	statistical inference would require an effort disproportionate to the
23400	information-yield.
23500		A more efficient and informative way to use Turing-like tests
23600	is to ask judges to make ordinal ratings along scaled dimensions from
23700	teletyped  interviews.     We  shall  term  this  approach asking the
23800	dimension-question.   One can then compare scaled ratings received by
23900	the patients and by the model to precisely determine where and by how
24000	much they differ.        Model builders  strive  for  a  model  which
24100	shows     indistinguishability     along    some    dimensions    and
24200	distinguishability along others.  That is, the model converges on what
24300	it is supposed to simulate and diverges from that which it is not.
24400		We  mailed  paired-interview  transcripts  to   another   400
24500	randomly  selected psychiatrists asking them to rate the responses of
24600	the two `patients' along certain dimensions. The judges were  divided
24700	into  groups,  each  judge  being asked to rate responses of each I-O
24800	pair in the interviews along four dimensions.  The  total  number  of
24900	dimensions  in  this  test  were twelve- linguistic noncomprehension,
25000	thought disorder, organic brain syndrome, bizarreness,  anger,  fear,
25100	ideas  of  reference, delusions, mistrust, depression, suspiciousness
25200	and mania. These are dimensions which psychiatrists commonly  use  in
25300	evaluating patients.
25400		Table 1 shows there were significant differences, with  PARRY
25500	receiving   higher   scores   along   the  dimensions  of  linguistic
25600	noncomprehension,thought disorder, bizarreness, anger,  mistrust  and
25700	suspiciousness. On the dimension of delusions the patients were rated
25800	significantly higher. There were no significant differences along the
25900	dimensions  of  organic  brain  syndrome,fear,  ideas  of  reference,
26000	depression and mania.
26100		While    tests    asking    the   machine-question   indicate
26200	indistinguishability at  the  gross  level,  a  study  of  the  finer
26300	structure  os  the  model's  behavior  through  ratings  along scaled
26400	dimensions  shows  statistically  significant   differences   between
26500	patients  and  model.     These  differences are of help to the model
26600	builder in suggesting which aspects of the model must be modified and
26700	improved  in  order  to  be  considered an adequate simulation of the
26800	class of paranoid patients it is intended to simulate.  For  example,
26900	it  is  clear  that  PARRY'S language-comprehension must be improved.
27000	Once this has been implemented, a future test will  tell  us  whether
27100	improvement has occurred and by how much in comparison to the earlier
27200	version.   Successive identification of particular areas  of  failure
27300	in the model permits their improvement and the development of
27400	more adequate model-versions.
27500		Further evidence that the machine-question is too coarse  and
27600	insensitive  a test comes from the following experiment. In this test
27700	we constructed a random version of the paranoid model which  utilized
27800	PARRY'S  output statements but expressed them randomly no matter what
27900	the interviewer said.   Two psychiatrists conducted  interviews  with
28000	this  model, transcripts of which were paired with patient interviews
28100	and sent to 200  randomly  selected  psychiatrists  asking  both  the
28200	machine-question  and  the dimension-question.  Of the 69 replies, 34
28300	(49%) were right and 35 (51%) wrong. Based on this random  sample  of
28400	69  psychiatrists,  the 95% confidence interval ranges from 39 to 63,
28500	again indicating  a  chance  level.  However  as  shown  in  Table  2
28600	significant  differences  appear  along  the dimensions of linguistic
28700	noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
28800	rated  higher.  On  these  particular  dimensions  we can construct a
28900	continuum in which the random version  represents  one  extreme,  the
29000	actual patients another. Our (nonrandom) PARRY lies somewhere between
29100	these two extremes, indicating that it performs significantly  better
29200	than  the  random version but still requires improvement before being
29300	indistinguishable from  patients.(See  Fig.1).  Table  3  presents  t
29400	values   for   differences   between   mean   ratings  of  PARRY  and
29500	RANDOM-PARRY. (See Table 2 and Fig.1 for the mean ratings).
29600		Thus it can be seen that  such  a  multidimensional  analysis
29700	provides  yardsticks  for measuring the adequacy of this or any other
29800	dialogue simulation model along the relevant dimensions.
29900		We conclude that when model builders want  to  conduct  tests
30000	of adequacy which  indicate  in  which  direction  progress  lies and to obtain a
30100	measure of whether  progress  is  being  achieved,  the  way  to  use
30200	Turing-like  tests  is  to  ask  expert  judges to make ratings along
30300	multiple dimensions that are essential to the model. A good validation
30400	procedure has criteris for better or worse approximations. Useful tests do
30500	not  prove  a  model, they probe it for its strengths and weaknesses and
30600	clarify what is to be done next in modifying and repairing the model.
30700	Simply asking the machine-question yields little information relevant
30800	to what the model builder most wants  to  know,  namely,  along  what
30900	dimensions must the model be improved.
31000	
31100